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Abstract. In the emerging eScience environment, reposito- 
ries of papers, datasets, software, etc., should be the foun- 
dation of a global and natively-digital scholarly communica- 
tions system. The current infrastructure falls far short of this 
goal. Cross-repository interoperability must be augmented to 
support the many workflows and value-chains involved in 
scholarly communication. This will not be achieved through 
the promotion of single repository architecture or content rep- 
resentation, but instead requires an interoperability frame- 
work to connect the many heterogeneous systems that will 
exist. 

We present a simple data model and service architecture 
that augments repository interoperability to enable scholarly 
value-chains to be implemented. We describe an experiment 
that demonstrates how the proposed infrastructure can be de- 
ployed to implement the workflow involved in the creation of 
an overlay journal over several different repository systems 
(Fedora, aDORe, DSpace and arXiv). 



1 Introduction 

The manner in which scholarly research is conducted is chang- 
ing rapidly. This is most evident in Science and Engineer- 
ing [42], but similar revolutionary trends are becoming ap- 
parent across disciplines [43]. Improvements in computing 
and network technologies, digital data capture techniques, 
and powerful data mining techniques enable research prac- 
tices that are highly collaborative, network-based, and data- 
intensive. Moreover, the notion of a unit of scholarly com- 
munication is changing fundamentally. Whereas in the paper 
world, the concept of a journal publication dominated the def- 
inition of a unit of communication, in the emerging eScience 
environment, units of communication are increasingly com- 
plex digital objects. The digital objects can aggregate datas- 
treams with both a variety of media types and a variety of 



intellectual content types, including papers, datasets, simula- 
tions, software, dynamic knowledge representations, machine 
readable chemical structures, etc.. Repositories that host such 
complex digital objects are appearing on the network at a 
rapid pace. 

In the light of these profound changes, we envision the 
emergence of a natively digital scholarly communication in- 
frastructure that has this wide variety of repositories as its 
foundation. This infrastructure would leverage the value of 
the digital objects in the underlying repositories by making 
them accessible for use and re-use in many contexts. In this 
infrastructure, repositories are not regarded as static nodes 
in a scholarly communication system that are merely tasked 
with archiving digital objects that were deposited there by 
scholars. Rather, repositories are perceived as the building 
blocks of a global scholarly communication federation in whi- 
ch each individual digital object can be the starting point of 
value chains with global reach. 

Implementation of this infrastructure brings up a variety 
of intriguing prospects and associated questions across the 
whole sociological-economical-legal-technical spectrum. In 
the Pathways project, a joint project between Cornell Univer- 
sity and the Los Alamos National Laboratory, we are explor- 
ing the technical problem domain. We focus on identifying 
and specifying the fundamental components required to fa- 
cilitate the emergence of a natively digital, repository-based 
scholarly communication system. Our research tries to find 
the appropriate level of cross-repository interoperability that 
will provide a sufficiently functional technical basis for the 
realization of the vision, and will stand a realistic chance of 
being implemented in existing and future repository systems. 

This work is important because the current level of cross- 
repository interoperability is inadequate to support advanced 
forms of communication. Different communities have fol- 
lowed their own perspectives on repository design, imple- 
mentation and management as well as on digital object rep- 
resentation and identification. Current interoperability is pro- 
vided mainly by support of the OAI-PMH [34] and its manda- 
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tory Dublin Core metadata format [20]. Realizing the vision 
will require significantly augmented cross-repository interop- 
erability. 

The remainder of this paper is organized as follows. Sec- 
tion 2 presents several motivating scenarios, and then sec- 
tion 3 describes related interoperability work. The follow- 
ing sections then introduce ideas for a cross-repository inter- 
operability framework that have resulted from the Pathways 
project. The proposed high-level requirements for participat- 
ing repositories can be summarized as follows: 

- Support for a shared data model for digital objects (sec- 
tion 4.1). 

- Support for a surrogate format that serializes the digital 
object in accordance with the data model (section 4.2). 

- Support for three core repository interfaces: obtain, har- 
vest and put, to allow dissemination and ingest of surro- 
gates (section 5). 

The proposed framework also requires a shared service 
registry that lists the network location of the core interfaces 
for participating repositories. Section 6 describes the service 
registry and possible format and semantic registries that would 
further empower the environment. In section 7 we describe 
experiments to implement an overlay journal scenario (an ex- 
ample we will use repeatedly throughout this paper) using 
this framework over existing repositories. Section 8 presents 
plans for future work, and section 9 draws some conclusions. 
A less technical exposition of these ideas is given in [16]. 

2 Motivating context 

In order to gain insights into the characteristics of the desired 
interoperability framework, it is helpful to investigate scenar- 
ios that drive this need for augmented interoperability. We 
see two classes of cross-repository value-chains: rich cross- 
repository services and cross-repository scholarly communi- 
cation workflows. 

In the first class of cross-repository value chains, repos- 
itories are regarded as sources of materials that can be used 
in services with a reach beyond the boundaries of a single 
repository. Materials should be exposed by repositories in a 
manner that allows for the seamless emergence of rich and 
meaningful services. Discovery services are an obvious ex- 
ample of this class, and, although support of the OAI-PMH 
has resulted in a suite of cross-repository discovery capabil- 
ities, their functionality remains limited. For example, imag- 
ine creating a special-purpose search engine that collects only 
machine-readable chemicals structures, expressed using the 
XML Chemical Markup Language (CML), contained in dig- 
ital objects hosted by repositories worldwide. The current in- 
teroperability environment provides neither the ability to ex- 
pose digital objects at a repository interface in a manner that 
unambiguously reveals the digital object's constituent datas- 
treams, nor the language to express their intellectual content 
type (e.g. chemical structure). As a result, the creation of 
the cross-repository chemical search engine would currently 



be truly complex, and would involve numerous repository- 
specific trial and error procedures. 

Consider the case where monitoring agencies make se- 
mantically tagged data on Arctic sea ice available in inter- 
operable repositories. An automated alerting service might 
then be able to discover and use both raw and processed data 
(with raw data provenance accurately indicated) to provide 
early warning of events such as the abrupt shrinkage in Arc- 
tic sea ice in 2005. The output might be a report, a new dig- 
ital object, containing both static 'snapshot' results and im- 
porting dynamically computed elements. Accurate version- 
ing of datasets would allow readers to be made aware of later 
amended inputs and perhaps even to recompute the results 
included in the report based on machine-actionable descrip- 
tions of the transform and visualization service. A newspaper 
article on the findings might reference the source reports al- 
lowing readers to delve into and understand the sources and 
the basis of the claims as far as their understanding permits. 

In the second class of cross-repository value chains, repos- 
itories are regarded as the basic building blocks of a digital 
communication system, and scholarly communication itself 
is seen as a global cross-repository workflow [18]. Digital 
objects contained in repositories are the subjects of the work- 
flows, and are used and re-used in many contexts. 

Citation is probably the most obvious example of this. 
In today's scholarly communication system, citation is im- 
plemented by inserting textual information describing a cited 
paper at the end of the citing paper, either by just typing it, by 
copy/pasting it from a Web page, or by importing metadata 
from a personal bibliographic citation tool. Thus, citations 
that are included in a digital manuscript are purely textual and 
are not natively machine readable or machine actionable. As 
a result, various post-factum approaches have been devised 
to connect citing paper to cited paper by means of hyper- 
links in the Web environment [13]. These approaches include 
fuzzy metadata-based citation matching [23], the DOI-based 
CrossRef linking environment [11], and the OpenURL frame- 
work for context-sensitive linking [14]. The variable quality 
of citation metadata, among other factors, means that none 
of these approaches is foolproof. Furthermore, it is challeng- 
ing to imagine how these approaches would extend beyond 
conventional scholarly papers, into the realm of complex dig- 
ital objects that contain datasets, simulations, visualizations 
and so forth. It is therefore intriguing to think about citation 
as the re-use of the cited digital object in the context of the 
citing digital object. 

To understand this expanded view of citation, imagine be- 
ing able to drag a machine readable representation of a digital 
object hosted by some repository, and to drop it into the citing 
object that, once finalized, is submitted into another reposi- 
tory. Now imagine being able to do the same for the citing ob- 
ject ad infinitum. Assuming that the machine readable repre- 
sentations that are being dragged and dropped contain the ap- 
propriate properties, the result would be a natively machine- 
traversable citation graph that would span across repositories 
worldwide. With appropriate user tools this would not only 
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be vastly more functional than current forms of citation, but 
also simpler to use and to manage. 

Collectively, these scenarios lead to a number of high- 
level observations: 

Long-term perspective — Scholarly communication is a 
long-lasting endeavor, and, as a consequence, a long-term 
perspective should inspire the thinking about a future dig- 
ital scholarly communication infrastructure. Clearly, this 
yields requirements related not just to the longevity of 
repositories and their collections, but also to the inter- 
operability framework. The framework should be defined 
with sufficient abstraction to allow implementation using 
different technologies as time goes by, and should not be 
tied to a specific type of identifier, but rather support all 
current and future identification systems. 

Content-transfer is often unnecessary — Most of the value 
chains illustrated in the above scenarios do not require the 
transfer of all digital object content. Instead just a subset 
appropriate to the particular value chain. For example, the 
citation scenario requires only the transfer of the biblio- 
graphic metadata of the cited paper, whereas the search 
engine scenario only requires the transfer of the chemical 
formula. Full content-transfer as required for repository 
mirroring is just one of many use cases that should be 
enabled by a desired solution. 

Fine grained identification — Identifiers of journal articles, 
such as DOIs, are typically repository independent in the 
sense that copies of a paper with a given identifier stored 
by multiple repositories share the same public identifier. 
This level of identification granularity is sufficient for ci- 
tation purposes. However, it becomes inadequate when 
trying to record the chain of evidence for cross-repository 
value chains because these have a specific digital object 
from a specific repository as their subject. This means that 
a finer level of identification granularity is required than 
provided by the existing bibliographic infrastructure. 

3 Related work 

Pathways is focused on defining a common data model and 
service interfaces. These are designed to enable re-use and 
re-combination of digital objects and their components, to fa- 
cilitate workflows over distributed repositories, and to enable 
computation and transformation of digital objects with dy- 
namic service linkages. A key aspect of this work is that it 
explicitly handles the notion of provenance or lineage when 
content is re-used. 

A significant amount of work exists in the design and 
specification of data models for digital objects, and in the 
creation of XML representation formats to promote the inter- 
operable transmission and exchange of digital objects. XML 
representation formats include the Metadata Encoding and 
Transmission Standard (METS) [37], the MPEG-21 Digital 
Item Declaration Language [8], the IMS Content Packaging 
XML Binding [26], and the XML Formatted Data Unit (XF- 
DU) [45]. Many of these formats have been used to enable the 



transfer of digital assets among systems. A notable example 
is the use of MPEG-21 DIDL in the transfer of the American 
Physical Society collections to Los Alamos National Labora- 
tory [6]. 

There is no doubt that multiple data models for complex 
objects exist and will continue to be favored by different com- 
munities. The challenge is to develop a simple and flexible 
overlay data model that does not depend upon asset transfer, 
and can accommodate the essence of these different content 
models, yet can provide a simple low-barrier entry point for 
interoperability among repositories. 

The Content Object Repository Discovery and Registra- 
tion/Resolution Architecture (CORDRA) [33] is similar to 
Pathways in its goal to provide an open, standards-based mo- 
del for repository interoperability. However, CORDRA is pri- 
marily focused on enabling interoperability between learning 
object repositories via federated registries of metadata cat- 
alogs. Unlike Pathways, CORDRA is specified upon a land- 
scape of authored metadata. The Pathways data model is built 
upon a generic, graph-based abstraction that does not pre- 
scribe specific metadata other than small set of key attributes 
for objects. While CORDRA offers support for retrieval of 
content, Pathways addresses both retrieval and write for com- 
plex objects among heterogeneous repositories. Finally, while 
a distributed name resolution system (e.g., the handle system) 
is a necessary architectural pillar in CORDRA, the Pathways 
identifier scheme does not depend on a shared digital object 
identifier resolution service shared by distributed reposito- 
ries. 

There are a number of other projects in the higher edu- 
cation community devoted to the goal of repository interop- 
erability within service-based architectures. Similarly moti- 
vated work is being done with the EduSource Community 
Layer (ECL) [21], the DLF Asset Action Experiments [19], 
the Open Knowledge Initiative Open Service Interface Defi- 
nitions (OSIDs) [40]. These projects each specify a middle- 
ware service layer to enable applications to be built over het- 
erogeneous data sources. Pathways is distinguished in that it 
is focused on defining a minimal set of read/write services 
necessary to enable access and re-use of complex objects in 
a distributed, heterogeneous repository environment. The in- 
tent of Pathways is to specify relatively lightweight services 
that are easy to deploy over existing architectures. At the 
same time, Pathways is motivated to provide a model that 
can record and exploit provenance relationships as content is 
re-used across different services. 

The challenge of service-based repository interoperabil- 
ity is being taken up by many other communities, often with 
different definitions of the basic concept of a "repository". 
In terms of service interfaces and APIs, repository interop- 
erability is being addressed both from an access perspective 
and an authoring perspective (i.e., write, put). Many efforts 
are positioned around a limited view of "content", typically 
individual content byte streams (an image, a web page, a PDF 
document), or hierarchies of content byte streams with simple 
descriptors. 
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For example, there are many new services that position 
the web as both a readable and writable space, albeit in a 
limited manner. Atom [3] provides an API for an application 
level protocol for publishing and editing web resources. It 
also provides an XML data format that can be used in both 
the syndication and authoring of content. The Amazon S3 [2] 
web service provides an interface to support reading and writ- 
ing, ultimately providing an internet data storage service that 
is scalable, reliable, and fast. SRW Search/Retrieve and Up- 
date [41] defines a web-service interface for retrieving and 
updating metadata records. Web-based Distributed Author- 
ing and Versioning (WebDAV) [44] enables the web servers 
to be exposed as writable, in addition to readable, by provid- 
ing an interface for uploading content using a file and direc- 
tory paradigm. Each of these services share with Pathways 
the notion of simple web-based interfaces for creating and 
accessing content over the web. However, a key distinction 
of Pathways is its focus on complex digital objects as units 
of content as compared to single-content byte streams (e.g., 
a file). Another distinction of the Pathways work is that it is 
primarily intended to be an interoperability model for man- 
aged repositories, as distinguished from more nebulous stor- 
age services on the open web. 



Pathways employs a graph or tree-based data model to 
overlay heterogeneous data sources, which is also the basis 
of several other efforts. Recently, JSR 170 [30] has garnered 
much attention. This is a specification of a Java-based API 
for interacting with heterogeneous "content repositories" and 
repository-like applications in a uniform manner. The basic 
metaphor for interaction is that of a hierarchy of nodes with 
properties, where node properties can be either simple data 
types or binary streams. JSR 170 is positioned similarly to 
how JDBC is for relational databases. It is most useful for 
developing Java-based applications with a standard interface 
for connecting with content storage components (i.e., "con- 
tent repositories"). Since it is not web services oriented, it is 
not clear the impact it could have in providing interoperability 
among distributed institutional repositories, and in non-Java 
environments. 



The Pathways framework is intended to be consistent with 
existing and emerging web architecture principles and should 
be easily implemented using existing web protocols and stan- 
dards. In considering the W3C recommendation for the Ar- 
chitecture of the World Wide Web [28], the Pathways frame- 
work has been influenced by the need for URTbased iden- 
tifiers for resources, the notion that resources can have one 
or more "representations", and that these representations can 
be sent or received via simple protocols. Pathways is influ- 
enced by work in the semantic web community, particularly 
the Resource Description Framework (RDF) [32] as a mo- 
del for expressing resources in a graph-oriented manner as 
resource nodes with property and relationship arcs. 



4 Digital objects, the Pathways Core data model, and 
surrogates 

The goal of our work on data models and interfaces is the 
creation of an interoperability layer, as indicated in figure 1. 
We expect that this layer will overlay data models and service 
interfaces that are distinct to individual repository implemen- 
tations. These repository-specific models and interfaces may 
provide functionality outside and above the models and in- 
terfaces described here, which are intended to represent the 
intersection (rather than union) of individual repository fea- 
tures. 

We use the following definitions: 

Digital object — In the manner of the seminal Kahn and 
Wilensky paper [31] we use the notion of a digital object 
to describe compositions of digital information. This is 
purposely abstract, and is not tied to any implementation 
or data model. The principal aspects of a digital object 
are digital data and key-metadata. Digital data can be any 
combination and quantity of individual datastreams, or 
physical streams of bits, and can consist of nested digital 
objects. Key-metadata, at a minimum, includes an identi- 
fier that is a key for service requests on the digital object 
at a service point. 

Data model — We describe a data model, the Pathways Core, 
that provides a formalization for overlaying digital ob- 
jects on a network of heterogeneous repositories and ser- 
vices. We use UML to describe this data model, but it 
could be described in other formalizations such as XML 
schema or OWL. 

Surrogate — We use the term surrogate to indicate concrete 
serializations of digital objects according to our data mo- 
del. The purpose of this serialization is to allow exchange 
of information about digital objects from one service to 
another and thus propagate them through value chains. 
We use RDF/XML for constructing our surrogates, be- 
cause it is useful for representing arbitrary sub-graphs. 

The primary goal of the data model, and consequently the 
surrogates that represent it on the wire, is not asset or content- 
transfer. Rather we have designed the data model primarily 
as a framework that describes the abstract structure of infor- 
mation objects, and the properties of that abstract structure 
such as lineage, identity, and semantics. The linkage in the 
model from the abstract structure to the physical content is 
by-reference rather than by-value containment. 

There are a number of good reasons to not mandate as- 
set transfer in the interoperability fabric. Full asset transfer 
is necessary for only a subset of possible applications. One 
notable one is preservation mirroring, and thus preservation 
frameworks such as the Reference Model for an Open Archival 
Information System (OAIS) [39] include the notion of infor- 
mation packages that imply full transfer of information units. 
Many other applications such as the overlay journal example 
described later can be accommodated without the overhead 
of shipping all the bits between repositories and services. By 
supporting by-reference content, the Pathways Core model 
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Fig. 1. Interoperability layer over heterogeneous repositories. 



enables services to selectively decide when and if to derefer- 
ence and pull content into the service environment. This pro- 
motes the notion of "service-tuned" asset transfer, where each 
service can be configured to respond to by-reference content 
in a manner appropriate to the context. 

In a number of cases, full asset transfer is forbidden or 
undesirable. For example, a rights holder may be willing to 
allow inclusion of their asset in another context by means of 
reference through a surrogate, but may be unwilling to trans- 
fer the datastream itself. Or, the rights holder might allow the 
asset transfer if the assets are placed in some digital rights 
management (DRM) wrapper. 

Finally, static transfer of an asset may be undesirable in 
the case of dynamic information objects, such as data sets 
derived from sensor networks. We foresee a number of appli- 
cations in the scholarly domain where such dynamic objects 
are desirable, such as astronomy publications that include the 
latest sky survey data. 

4.1 Pathways Core data model 

The Pathways Core data model is based on the notion of a 
graph of abstract entities with concrete datastreams as leaves. 
In this model, a digital object is a sub-graph rooted at an en- 
tity. The data model is designed to meet the following re- 
quirements: 

1. It permits recursion for arbitrary levels of entity contain- 
ment. 

2. It provides an explicit link to the concrete representation, 
or component datastreams, of the digital object. 

3. It includes a notion of object identity that is independent 
of specific identifier schemes. 

4. It expresses lineage among objects, providing evidence of 
derivation and workflow among objects. 



5. It accommodates the linkage of semantic tags to informa- 
tion entities that extend the functionality of format tags to 
the domain of complex, multi-part objects. 

6. It allows the maintainer of the object to assert persistence 
of the availability of a surrogate. 

A UML structure diagram of the Pathways Core is shown 
in figure 2. The correspondence of features of the model to the 
requirements list above is indicated by the numbered proper- 
ties. Each feature of the model is explained in more detail in 
the following sections. Our goal has been to find the minimal 
set of features necessary, the core properties. Certain uses or 
applications may require refinement of these relationships or 
the addition of new relationships, and we believe that such 
extensions can be added without breaking the core function- 
ality. 

4.1.1 Entity recursion 

At the root of the Pathways Core is the notion of an en- 
tity. As shown in figure 2, this is the attachment point of a 
set of properties that associate the entity with its required 
and optional features. One property is hasEntity, which ex- 
presses recursive containment of entities. This maps to the 
Kahn/Wilensky [31] notion that digital objects can contain 
nested digital objects. An example of the utility of this re- 
cursive relationship is modeling of an overlay journal. In this 
case, a top level entity could represent the journal itself, with 
semantic, persistence, and identity attributes that correspond 
to the journal. A journal "contains" issues, which themselves 
may be entities, with associated properties. This recursion 
naturally continues, with issues "containing" articles. 

As indicated in figure 3, an entity is an abstract concept, 
distinct from concrete datastreams described in the next sec- 
tion. This abstract/concrete distinction is fundamental to the 
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Fig. 2. UML diagram of the Pathways Core data model with parts that fulfill particular requirements numbered. 



model — removing assertions of identity, persistence, lin- 
eage, and semantics from individual physical manifestations 
of intellectual objects. This separation of abstract and con- 
crete properties (or attributes) is similar to that in the FRBR 
model [38]. 

4.1.2 Concrete representation 

As indicated in figure 2, an entity can have several hasDatas- 
tream properties. The motivation for this is well-established 
in compound document formats such as METS, MPEG-21 
DIDL, and Fedora FOXML, which allow a single object to 
have multiple datastreams with different media types (e.g., 
the availability of a scholarly paper in PDF, Word and TeX). 

A datastream has both a format (e.g. a format registered 
in GDFR [1] or PRONOM [12]) and a location, a URL to 
request a dissemination of the datastream. The datastream as- 
sociation is intentionally by-reference rather than by-value, to 
avoid mandating asset transfer for the reasons given earlier. 

A typical digital object will contain one or more datas- 
treams. The digital object represented in figure 3 comprises 
a top-level entity Ei, with sub-entities E2 and E3. The entity 
E2 has two datastreams, Di and D2, which might be alternate 
expressions of the entity E2. As semantic assertions appear 
only at the level of the entity, both Di and D2 are assumed to 
have any semantics expressed for E2. The entity E3 has just 
a single datastreams D3 and may thus be used to express se- 
mantics that apply just to D3, as separate from the semantic 
of Ei. For example, Ei might be a "journal issue" with two 
articles E2 and E3, and the article E2 happens to be available 
in both PDF and Word formats. 




Abstract, recursive 

s (entities) 



Concrete 

(datastreams) 



Fig. 3. Entity recursion and concrete representation. 



4.1.3 Identity 

We recognize the reality that one identifier technology will 
never dominate and have thus incorporated two notions of 
identity. First, the hasldentifier property allows expression of 
URIs associated with a digital object, a DOI for example. 
Second, the hasProviderlnfo property introduces a relatively 
simple repository-centric identifier paradigm which permits 
precise identification of digital objects in the particular repos- 
itory, facilitating re-use and accurate provenance records. This 
paradigm is not intended to replace existing identifier mech- 
anisms or to interfere with future technologies in this area. 
Rather it is intended as a future-proof long-term scheme that 
can co-exist with other identifier mechanisms. 

The hasProviderlnfo property has three components: 
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provider — The identity of the repository (i.e.; the service 
point providing access and ancillary services on the digi- 
tal object). We assume that the participants in this infras- 
tructure — institutional repositories and the like — have 
a commitment to of their repository identity. Indirection 
via a repository identifier presumes some technology for 
registering repositories and resolution to the location of 
their service interfaces. Registries are discussed later, in 
section 6. 

preferredldentifier — The identity of the entity within the 
repository. This serves as the key for making service re- 
quests upon the digital object at the service point defined 
by the repository (provider) identity. As explained later in 
this paper, the basic repository service is a request for a 
surrogate of the digital object. We expect, however, that a 
host of other services will evolve. We emphasize that the 
syntax, semantics, and resolution of the identity of the ob- 
ject is local to the individual repository, rather than being 
global as in more ambitious identifier schemes. 

versionKey — This is a means of parameterizing a service 
request on an object according to version semantics. The 
intention here is to provide an opaque hook into individ- 
ual repository versioning implementations, rather than as- 
suming or imposing some universal cross-repository ver- 
sion schema. 

Two copies of the same object in two different reposito- 
ries may have the same identifier expressed via the haslden- 
tifier property. However, they will have different providerlnfo 
because they are available from different repositories. 

4.1.4 Lineage 

Isaac Newton wrote "If I have seen further it is by standing on 
the shoulders of Giants" [27]. In the face of massive changes 
in scholarship since Newton's time, one constant is the evolu- 
tion of scholarship, whereby new results are built on the inno- 
vations of earlier scholars. We believe therefore that the inter- 
operability infrastructure must support the notion of lineage, 
natively linking entities to other entities from which they are 
derived. 

As shown in figure 2, entities in the model can link to 
other entities through the hasLineage relationship. This link- 
age leverages the hasProviderlnfo identity of the entity (or 
entities) from which the new entity derives, thus allowing an 
entity to express its derivation from another entity and specif- 
ically state both the repository origin of the source object and 
its version semantics. Furthermore, since the model is recur- 
sive, entities can contain entities and the derivation of con- 
tained parts of objects can be similarly expressed. 

This lineage capability is illustrated in figure 4. The en- 
tity labeled Ei is derived from that labeled E2. For example, 
Ei may be translation of E2 into a new language. E2, as illus- 
trated, contains sub-entities with respective derivations from 
E3 and E5. For example, E2 may be an issue of an overlay 
journal with articles that are edited versions of the preprints 
E3 and E5, where E5 is itself a sub-entity of the preprint series 
E4. These cases illustrate re-use at different granularities. 




Fig. 4. Relating entities by lineage. 



The result of these lineage links among entities at the in- 
teroperability layer is a web of evidential citation. This graph 
indicates both the workflow origins of an information object 
— the partial ordering of information objects from which it 
derives — and also the curatorial heritage of the object — 
the repositories and services responsible for its legacy. This 
new, uniquely networked and digital form of citation pro- 
vides a finer level of identification than convention biblio- 
graphic citation. In the case that a repository has an object 
derived from an object in another repository, there is a local 
choice as to whether the same object identifier is used or a 
new one generated. This choice would be presumably be in- 
fluenced by repository policy, community agreements and by 
the kind of value chain implemented. In either case, two ob- 
servations can be made. First, the providerlnfo includes the 
provider which make the complete identification unique and 
distinguished the objects. Second, the hasLineage property of 
the derived entity provides and unambiguous link back to the 
original entity. 

Both cases are illustrated in figure 5, which shows the 
entity labeled Ei taking part in value chains that result in 
new entities, E2 and E3, in different repositories. In all cases 
the entities are uniquely identified by the providerlnfo, even 
though E 2 has the same preferredldentifier as Ei. Also, both 
E2 and E3 indicate their lineage from Ei with the providerlnfo 
extracted from Ei . 
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repository 3 

providerlnfo={repo3,id2} 
hasLineage-- {repol ,id1 } 




Fig. 5. Identification and lineage of derived entities in different repositories. 
The shorthand {provider.preferredldentifier} is used for providerlnfo, and 
versionKey is omitted. 

We imagine that the hasLineage relationship is a super- 
class of the many types of inter-entity derivation relationships 
that could be expressed. Thus, future evolution of the infras- 
tructure might refine this relationship. 

4.1.5 Semantics 

We envision applications that need to know about the "se- 
mantic" composition of digital objects in addition to know- 
ing the media-format types of the individual datastreams. A 
complex digital object might represent a "dissertation" or a 
"journal article", each of which might have datastreams that 
are images, data sets, spreadsheets, or text in various formats. 

One particularly interesting application is service match- 
ing. The utility of automated match of preservation services 
to information objects has been demonstrated by the PANIC 
work [24] . While PANIC demonstrates the utility of automa- 
tion for individual datastreams based on media type, we would 
like to enable similar services over complex objects and based 
on intellectual content types. 

The Pathways Core therefore associates the hasSeman- 
tic property with each entity. The target of this property is 
a URI specifying the semantic typing of the entity. Admit- 
tedly, no universal semantic registry exists at this time. How- 
ever, the property could be exploited by individual commu- 
nities that develop local schemes, and later extended to more 
widespread use. 

4.1.6 Persistence 

The history of persistence of information artifacts, especially 
digital objects, is riddled with examples of the gaps between 
intention, expectation, and reality. Despite our best intentions 
to provide storage of and access to digital information "for- 



ever" (or even a few months!), the realities of hardware fail- 
ures, format rot, and mismanagement frequently interfere. This 
must be considered in the design of any information interop- 
erability framework. 

Therefore, we have taken a purposely modest approach 
to persistence that is oriented towards surrogates and ser- 
vices over surrogates, rather than towards digital objects. The 
hasProviderPersistence property associated with an entity is 
a slot in which the repository can declare, by means of a 
URI, the longevity of its commitment towards providing ser- 
vices over the respective entity. The repository making this 
commitment is identified as the provider in the entity's has- 
Providerlnfo property. Since the core service in the interoper- 
ability fabric is the dissemination of a surrogate for the entity, 
hasProviderPersistence indicates the level of commitment of 
the respective repository to provide access to a surrogate for 
the entity. While there is clearly scope for subtle refinement 
of persistence declaration, at this point we propose a set of 
just two persistence declarations: 

- The entity is transient and the repository makes no com- 
mitment to providing services for it over time. 

- The entity is persistent and the repository intends to re- 
spond to service requests for it over time. 

4.2 Surrogates and serialization 

An individual instance of the Pathways Core data model, a 
representation of an individual digital object, is packaged and 
transmitted as a surrogate: a serialization that conforms to the 
data model. We note possible terminological confusion here 
but have not found a word with less baggage. By surrogate 
we mean a serialization that substitutes for the digital object 
and must therefore reveal all essential characteristics, and is 
thus distinguished from some arbitrary representation. The 
obtain and harvest interfaces (described in sections 5.1 and 
section 5.2) provide the means for clients to request a surro- 
gate. Similarly, a put request (described in section 5.3), which 
requests deposit of a digital object in a repository, contains a 
surrogate as a payload. 

We have found that RDF [32] is a useful tool for mod- 
eling the graph-like structure of information in the Pathways 
Core. We have done this by associating URIs with the prop- 
erties in the Pathways Core and similarly associating URIs 
with a number of controlled vocabularies such as persistence, 
formats, and semantics that are the values of Pathways Core 
properties. RDF modeling naturally led to the adoption of the 
RDF/XML syntax [4] as the serialization syntax for Pathways 
Core surrogates. A fragment of an example of this syntax is 
shown in figure 6. 

5 Repository interfaces: obtain, harvest and put 

We have described the Pathways Core data model and a sur- 
rogate that serializes the model. For these to enable repository 
interoperability, a set of essential services are required. Three 
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<?xml version="1 .0" encoding="UTF-8"?> 

<rdf:RDF xmlns:core="info:pathways/core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> 
<core:entity rdf:about="info:pathways/entity/info%3Asid%2Flibrary.lanl.gov%3Apathways/info%3Adoi%2F1 0.1 01 6%2Fj.dyepig.2004.1 2.01 0"> 
<core:hasSemantic rdf:resource="info:pathways/semantic/journal-article"/> 
<core:hasldentifier>info:doi/1 0.101 6/j.dyepig. 2004.1 2.01 0</core:hasldentif ier> 
<core:hasProviderPersistence rdf:resource="info:pathways/persistence/persistent"/> 
<core:hasProviderlnfo> 
<core:providerlnfo> 

<core:preferredldentifier>info:doi/1 0.1 01 6/j. dyepig. 2004.1 2.01 0</core:preferred Identifier 

<core:provider>info:sid/library.lanl.gov:pathways</core:provider> 
</core:providerlnfo> 
</core:hasProviderlnfo> 
<core:hasEntity> 

<core:entityrdf:about="info:pathways/entity/info... (shortened). Janl-repo%2Fssm%2Fdoi-1 0.1 01 6%2Fj.dyepig.2004.12.010"> 
<core:hasSemantic rdf:resource="info:pathways/semantic/bibliographic-citation"/> 
<core:hasldentifier>info:lanl-repo/ssm/doi-1 0.1 01 6/j. dyepig. 2004.1 2.01 0</core:nasldentifier> 
<core:hasProviderPersistence rdf:resource="info:pathways/persistence/persistent"/> 
<core:hasProviderlnfo> 
<core:providerlnfo> 

<core:preferredldentifier>info:lanl-repo/ssm/doi-1 0.101 6/j.dyepig. 2004.1 2.01 0</core:preferredldentifier> 
<core:provider>info:sid/library.lanl.gov:pathways</core:provider> 
</core:providerlnfo> 
</core:hasProviderlnfo> 
<core:hasDatastream> 
<core:datastream> 
<core:hasFormat rdf:resource="info:pathways/fmt/pronom/1 000"/> 

<core:hasLocation>http://purl.lanl.gov/demo/adore-arcfile/00e682eb-a87eb27b0c79</core:hasLocation> 

</core:datastream> 
</core:hasDatastream> 

Fig. 6. Excerpt from a sample surrogate that serializes the Pathways Core in RDF/XML. 



repository interfaces with the following functions fulfill this 
need and are described below: 

- An obtain interface which, in its most basic implemen- 
tation, allows the request of a surrogate for an identified 
digital object from a repository. 

- A harvest interface that exposes surrogates for incremen- 
tal collection or harvesting. 

- A put interface that supports submission of one or more 
surrogates into the repository, thereby facilitating the ad- 
dition of digital objects to the collection of the repository. 

5.7 Obtain interface 

Pathways defines an obtain interface that supports the request 
of services pertaining to an identified digital object within 
a repository. The simplest implementation of the obtain in- 
terface allows requesting a surrogate for an identified digital 
object from a repository. Such an interface can be regarded as 
an identifier-to-surrogate resolution mechanism that resolves 
the preferredldentifier of a digital object into a surrogate of 
that digital object. 

The information needed to construct an obtain request is 
recorded in the providerlnfo property of the surrogate itself. 
The providerlnfo is a triple consisting of the identifier of the 
repository that exposes the surrogate, the preferredldentifier 
of the digital object and an optional versionKey. By using the 
identifier of the repository, the location of the obtain inter- 
face of the identified repository can be found by a look-up in 
a service registry (see section 6). Once known, one can use 
the preferredldentifier of the digital object (and the optional 



versionKey) to obtain a surrogate using the repository's ob- 
tain interface. 

Higher levels of the obtain functionality have been ex- 
plored theoretically by Bekaert [5]. Straightforward exten- 
sion of the obtain concept allows the request of any supported 
service pertaining to an identified digital object. This includes 
the request of services pertaining to datastreams of the digital 
object. Possible examples are requests to obtain a surrogate 
of an identified article, requests to obtain a PDF datastream 
of that same article, requests to obtain an audio version of 
that article by applying a text-to-speech service upon the PDF 
datastream, and so forth. Such services can be considered a 
superclass of the basic obtain functionality described above, 
and do not have to be supported by all repositories. Rather, 
such services would typically be supported by autonomous 
service applications that overlay one or more repositories and 
use surrogates that are obtained through interaction with the 
core obtain interface of the underlying repositories. 

One technology that lends itself for implementing these 
obtain interfaces is the OpenURL Framework for Context- 
Sensitive Services [46]. The OpenURL standard originates 
from the scholarly information community where it was pro- 
posed a solution to the provision of context-sensitive refer- 
ence links for scholarly works such as journal articles and 
books [14]. The initial standard was generalized to create 
the current NISO OpenURL Framework which describes a 
networked service environment, in which packages of con- 
text information (ContextObjects) are used to request context- 
sensitive services pertaining to a referenced resource. Each 
ContextObject contains various types of information that are 
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needed to provide context-sensitive services. Such informa- 
tion may include the identifier of the referenced resource, the 
Referent, the type of service that needs to be applied upon the 
Referent (the ServiceType), the network context in which the 
resource is referenced, and the context in which the service 
request takes place. 

In this way, the core obtain interface can be implemented 
as an OpenURL Framework Application. The ContextObject 
used in the obtain request conveys the following information: 

A Referent — The digital object for which an obtain request 
is formulated. The Referent is described by means of its 
preferred Identifier. 

A ServiceType — The service that generates a surrogate of 
the identified digital object. 

Beyond meeting our basic requirements, the OpenURL 
Framework has the following attractive properties: 

- it makes a clear distinction between the abstract defini- 
tion of concepts and their concrete representation and the 
protocol by which such representations are transported. A 
ContextObject may be represented in many different for- 
mats and transported using many different transport pro- 
tocols, as technologies evolve. Yet, the concepts underly- 
ing the OpenURL Framework persist over time. 

- it does not make any presumptions about the identifier 
namespace used for the identification of digital objects (or 
constituents thereof), and hence, provides for an obtain 
interface that can be implemented across a broad variety 
of repository systems. 

- it allows information about the context in which the ob- 
tain request took place to be conveyed. This information 
may allow delivery of context-sensitive service requests. 
Of particular interest is information about the agent re- 
questing the obtain service (the Requester). This infor- 
mation could convey identity, and this would allow re- 
sponding differently to the same service request depend- 
ing on whether the requesting agent is a human or ma- 
chine. Similarly, different humans could receive different 
disseminations based on recorded preferences or access 
rights. The OpenURL Framework is purposely generic 
and extensible, and would also support to convey the char- 
acteristics of a user's terminal, the user's network con- 
text, and/or the user's location via the Requester entity. 
Though, this type of context-related tuning may not be 
important when requesting surrogates of digital objects, 
it may prove to be essential when requesting rich services 
pertaining to datastreams. 

5.2 Harvest interface 

A harvest interface allows collecting or harvesting of surro- 
gates of digital objects. In addition to the facility to harvest 
all the surrogates exposed by a repository, we believe it is 
necessary to provide a facility allowing some forms of selec- 
tive harvesting. The simplest, and perhaps most useful, form 
of selective harvesting is to allow downstream applications 



to harvest surrogates only for those digital objects that were 
created or modified after a given date. This echoes the Open 
Archives Initiative Protocol for Metadata Harvesting (OAI- 
PMH) [34] with the same motivation: downstream applica- 
tions may need an up-to-date copy of all the surrogates from 
a repository in order to provide some service, and incremen- 
tally harvesting surrogates of newly added or modified digital 
objects is an efficient way to do this. 

A harvest interface could be implemented using various 
technologies such as the OAI-PMH, RSS or Atom, or with 
a subset of more complex technologies such as SRU/SRW. 
The OAI-PMH is a well established harvesting technology 
within the digital library community and allows aggregation 
of metadata from compliant repositories using a datestamp- 
based harvesting strategy. Although the OAI-PMH was first 
conceived for metadata harvesting, it can be used to transfer 
any metadata or data format, including complex-object for- 
mats, expressed in XML according to an XML Schema [17]. 
The OAI-PMH is thus capable of providing the harvest func- 
tionality, and the ability to leverage existing OAI-PMH im- 
plementations is a significant benefit. 

To support the harvest interface, the underlying OAI-PMH 
interface must follow these conventions: 

- Each OAI-PMH item identifier must match the preferredl- 
dentifier of the Pathways Core digital object. This avoids 
the need for clients to record relationships between OAI- 
PMH identifiers and digital object identifiers which can 
become complex in various aggregation scenarios. 

- The OAI-PMH datestamps must be the datetime of cre- 
ation or modification of the digital objects as discussed 
in [17]. 

- It must provide a metadata format for surrogates as de- 
scribed in section 4.2. 

It is worth noting one possible issue. The OAI-PMH spec- 
ification is bound to the HTTP protocol and the XML syntax 
for transporting and serializing the harvested records. While 
this approach proves to be satisfactory in the current techno- 
logical environment, it may prove to be inadequate as tech- 
nologies evolve. If this work were to be tightly bound with 
the OAI-PMH then an abstract model would need to be cre- 
ated. However, if OAI-PMH is used simply as one possible 
technology to implement harvest functionality then it could 
later be replaced. 

5.3 Put interface 

Pathways defines a put interface to promote interoperable trans- 
mission of surrogates to one or more target digital reposito- 
ries. As with the obtain interface, digital objects are expressed 
as surrogates. At the interface level, put operations are sim- 
ple and unassuming. They can be understood as a request for 
deposit of a digital object. This distinguishes the put inter- 
face from similar operations found in other push-oriented ser- 
vices whose purpose is to facilitate upload of binary content 
streams, or transfer of assets using community-specific con- 
tent packages such as METS, IMS-CP, or MPEG-21 DIDL. 
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The put interface does not presuppose that target repos- 
itories conform to any particular underlying storage scheme 
(e.g., hierarchal file system, a web server with directories, re- 
lational database, etc.). Additionally, the put interface is neu- 
tral about the underlying data model of target repositories. 
The only requirement is that digital objects be represented as 
surrogates expressed in the Pathways Core, which is specif- 
ically designed to transcend the particulars of heterogeneous 
data models. The graph-based nature of this model provides 
the flexibility to support the submission of both simple and 
complex digital objects. 

The put interface, in combination with a surrogate, is in- 
tended as a means for transmitting just enough information 
to enable a receiving repository to make decisions on how to 
process a surrogate — without anticipating or assuming an 
underlying repository's requirements for ingest. Datastream 
content is expressed by-reference in the surrogate, via the lo- 
cation property. With this constraint, a surrogate represents a 
"shallow copy" of a complex object since there is no trans- 
mission of raw content within the surrogate. As discussed 
earlier (section 4), this constraint is motivated by the need 
for simplicity, and the desire to keep authentication and au- 
thorization concerns out of the functional definition of the 
put interface. Authentication, authorization and policy are ex- 
pected to be handled at service implementation layers. 

Unlike the obtain and harvest interfaces, no protocol or 
technology stands out as an obvious implementation option 
for the put interface. 



6 Registries 

The proposed interoperability framework requires at least one 
supporting infrastructure component: a service registry to as- 
sociate providers with services. Additional format and se- 
mantic registries would significantly enrich the environment. 
No particular technical implementation is implied by the use 
of the term registry, but rather the general ability to record, 
share and retrieve terms of a controlled vocabulary alongside 
their associated properties. 

6. 1 Service registry 

A service registry is fundamental to the framework, as it fa- 
cilitates locating the core service interfaces of participating 
repositories. This registry has the identifier of a repository 
(provider from providerlnfo in the Pathways Core) as its pri- 
mary key, and it minimally stores the actual network loca- 
tion of the obtain, harvest and put services, where supported. 
Thus, given a surrogate with providerlnfo (provider, preferre- 
dldentifier, versionKey), it is possible for an application to use 
provider as a look-up key in the service registry to retrieve the 
location of the core service interfaces for the repository iden- 
tified by provider. Once this information is available, actual 
service requests can be issued against those interfaces. For 



example, in order to retrieve an up-to-date surrogate, the ap- 
plication can issue an obtain request using the preferredlden- 
tifier and optional versionKey of the providerlnfo as shown in 
figure 7. The use of a registry permits repository interfaces to 
change their network location, allows different services (in- 
cluding those not yet imagined) to be associated with a repos- 
itory, and makes the combination of repositories trivial. 

It should be noted that, in contrast to other repository 
federation approaches such as CORDRA [33], ADL-R [29], 
aDORe [15], and the Chinese Digital Museum Project [10], 
the proposed framework does not require a registry of all dig- 
ital objects in all contributing repositories thereby allowing 
location of a digital object given its identifier. In the proposed 
framework, a surrogate carries its self-identifying provider- 
lnfo, which, through the intermediation of the service reg- 
istry, allows location of the service interfaces of the originat- 
ing repository. This approach alleviates two major drawbacks 
inherent in the use of digital object registries. First, given an 
identifier, how does one know that it is an identifier of a dig- 
ital object from repositories contributing to the federation, 
and hence that a look-up in the federation's object registry is 
meaningful? It seems that this question can only be answered 
if all repositories in the federation share a common, recog- 
nizable identifier scheme. This is a significant requirement, 
especially in light of the considerations regarding the long- 
term horizon of desired solutions. Second, the scale of object 
registries is several orders of magnitude larger than that of the 
proposed service registry because the latter only contains an 
entry per repository, not per digital object. The repercussions 
for operating the registry infrastructure are obvious. 

6.2 Format and semantic registries 

While the service registry is essential for the operation of the 
proposed framework, two other registries, while less funda- 
mental, would significantly enrich the functionality of the an- 
ticipated environment. 

First, it is now widely recognized that repositories, espe- 
cially in preservation environments, must support more finely 
grained identification of digital media formats than is pro- 
vided by MIME types. A format registry that has the iden- 
tifier of a digital format as its primary key and that records 
various properties of the format have been proposed by both 
the PRONOM [12] and GDFR [1] efforts. Format identifiers 
would be used for the format property available at the datas- 
tream level of surrogates. Such a fine level of format identifi- 
cation would, for example, enable rich format-based service 
matching as explored in the PANIC [24] and aDORe [7] ef- 
forts. 

Second, automated object use and re-use would be en- 
hanced by identification of the intellectual content type of 
materials. A semantic registry that has the identifier of a schol- 
arly content type as its primary key and that records vari- 
ous properties of the content type would support this. To fa- 
cilitate syndicating, aggregating, post-processing and multi- 
purposing magazine, news, catalog, book, and mainstream 
journal content, the PRISM effort [25] has created such a 
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Fig. 7. Use of the service registry. 



vocabulary, but for materials typically used in a scholarly 
context it is lacking, making the semantics registry proba- 
bly more critical to pursue than the format registry for which 
the MIME types can serve as a pragmatic stand-in. Semantic 
identifiers would be used for the semantic property available 
at the entity level of surrogates. Returning to the chemical 
search engine scenario, appropriate semantic identification of 
an entity would allow an agent to recognize it as a machine 
readable chemical formula, and thus choose to ingest the as- 
sociated datastreams, the format of which can also be pre- 
cisely described. 



7 Experiments 



To test the ideas presented above, we created obtain, har- 
vest and put services to disseminate and ingest surrogates 
from and to several different repository architectures: Fedora, 
aDORe, DSpace and arXiv. We then used these interfaces to 
support the assembly of a number of articles from different 
repositories into a new issue of a hypothetical overlay jour- 
nal. Instead of relying just on the user interfaces of the partic- 
ipating repositories, we also created a resource-centric search 
service using the harvesting infrastructure provided by the 
harvest interfaces. To further enhance the demonstration, we 
combined these techniques with Live Clipboard [35] technol- 
ogy to allow surrogates to be moved among repositories via 
the usual drag-and-drop metaphor. We first describe the two 
parts of this experiment, and then discuss implementation is- 
sues and experiences. 



7.1 Harvesting journal articles to produce a 
resource-centric search service 

A number of projects have attempted to use OAI-PMH har- 
vested metadata for the creation of resource-centric or full- 
text discovery services. The principal problem is that resources 
cannot be unambiguously located from the simple Dublin 
Core metadata exposed by most OAI compliant repositories. 
This issue was discussed in detail in [17], where the use of 
complex object formats was proposed as a solution. While 
skeletal compared with formats such as METS and MPEG- 
21 DIDL, the surrogates proposed here also meet the require- 
ments for resource harvesting. We implemented a search ser- 
vice based on the Nutch [36] crawler and search service. In- 
stead of simply doing a web crawl, surrogates were harvested 
from the participating repositories using the harvest inter- 
face. Each surrogate was then introspected upon to select 
only those with the semantics 'journal-article" (we agreed 
on a small ontology for these experiments). All the appropri- 
ate surrogates were examined to extract format and location 
information to dereference the datastreams. The datastreams 
were then fetched and indexed while retaining their associa- 
tion with the surrogate. This process is illustrated in figure 8. 

In addition to the usual links back to the source reposi- 
tory and content excerpt, the search results display was aug- 
mented with a Live Clipboard icon allowing the surrogate to 
be copied into the copy/paste buffer on the user's computer, 
and thus easily passed to other applications as described be- 
low. 

7.2 Creation of a new issue of an overlay journal 

The scenario we have referred to most frequently is the com- 
position of a new issue of an overlay journal from articles in 
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Fig. 8. Use of Nutch to create a resource-centric search service over repositories supporting the Pathways harvest interface. 



different repositories. When combined with the search ser- 
vice just described, this scenario demonstrates the use of all 
three repository interfaces in a realistic scholarly value-chain. 
This scenario revolves around the editor of the overlay jour- 
nal, "Ed". The key data flows as Ed interacts with one source 
repository are shown in figure 9, and the complete sequence 
of actions required to create the new overlay journal issue are 
described below. 

1 . Select: Ed applies whatever selection and review policies 
the overlay journal uses to decide which articles should be 
included in the new issue. Ed selects three articles; one 
each from arXiv, from an aDORe based repository, and 
from a DSpace repository. 

2. Obtain: Consider first the selection of an article from 
arXiv. Ed navigates to the normal splash page for this arti- 
cle in whichever way is convenient, perhaps from Google, 
or from arXiv's own interface. The splash page not only 
displays the usual metadata, links to associate resources 
and links to the full-text, but also a Live Clipboard icon 
as shown in figure 10. By clicking on this icon, the Live 
Clipboard JavaScript uses arXiv's obtain interface to get a 
surrogate for the article which is stored in the copy/paste 
buffer of Ed's computer. 

3. Compose: Ed then goes to the editorial web-interface for 
the overlay journal and pastes the surrogate via the Live 
Clipboard JavaScript on that page. Behind the scenes, the 
surrogate is put into the Fedora repository hosting the 
journal. Here it is a matter of local policy whether the 
ingest mechanism simply stores the surrogate with ref- 
erences to included entities and datastreams, or whether 
these are dereferenced and also ingested. For this demon- 



stration we chose to ingest only the structural information 
— the entities — which simulates a "pure" overlay which 
simply links to articles in trusted repositories (perhaps 
with cyptographic signatures to guarantee that the origi- 
nal has not been altered). It would also be possible for the 
ingest system to implement a deep-copy and duplicate all 
the datastreams of a digital object. Note that in a real sys- 
tem Ed would have to authenticate with the repository for 
the overlay journal in order to be granted the privileges to 
put new content into the journal, presumably any attempt 
to put content by a non-authenticated and or unprivileged 
user would be denied. 

Complete composition: A similar process is repeated for 
articles from DSpace and from aDORe. Here Ed uses the 
search service described in 7.1. The search results show 
a Live Clipboard icon with each result. By clicking on 
this icon, the Live Clipboard JavaScript uses the search 
service's obtain interface to get a cached surrogate for the 
article which is stored in the copy/paste buffer of Ed's 
computer. The overlay journal issue then has three articles 
queued. 

Submit: When Ed is happy that all articles for the new 
issue are ready, the issue can be created as an entity in its 
own right by clicking the "Submit Issue" button. When 
this is complete, a surrogate for the new issue is available 
from the obtain and harvest interfaces of the overlay jour- 
nal repository and may be used by all the same services 
that intemperate with the underlying repositories. 
Visualize: To illustrate how other services can work within 
this framework, and to allow easy visualization of surro- 
gates, we created an additional OpenURL-based service 
to visualize the surrogate graph using WebDot [22]. Ex- 



14 



Warner et al: Pathways: Augmenting interoperability across scholarly repositories 



Overlay journal 
repository 




Live Clipboard copy 



Live Clipboard paste 



Fig. 9. Addition of an article from a participating repository to an overlay journal using Live Clipboard technology to copy a surrogate. 



ample output for an arXiv article containing an additional 
data datastream is shown in figure 1 1 . The OpenURL re- 
quest simply includes the providerlnfo of the surrogate 
which is enough to enable the surrogate to be obtained, 
rendered as an image of a graph with links to sub-entities, 
datastreams and registry entries for format and semantic 
URIs. 

Though only a demonstration, the process of compiling a 
new issue for an overlay journal described above uses many 
interoperability features provided by the Pathways framework 
which are simply not available in existing systems. By imple- 
menting this over several of the most popular repository tech- 
nologies we have demonstrated that this technology could 
readily be deployed. 

7.3 Implementation of the obtain and harvest services 

The obtain interface is the simplest service and was the first 
that we implemented for each repository. By choosing to base 
it on OpenURL we were able to leverage existing OpenURL 
implementations for some of the repositories, simply adding 
another service identifier (svcJd) for the obtain service. In 
implementing and obtain interface, one must work with the 
native data model of the underlying repository. The key deci- 
sions arise in translation of digital objects from their under- 
lying representations to the Pathways Core model. Since the 
Pathways Core model is flexible, this can be done in different 
ways, depending on how much visibility vs. encapsulation of 
component parts is desired. Parts that are made available for 
re-use should be modeled as entities with associated provider- 
Info. 

All of the repositories used in this experiment already 
supported the OAI-PMH so the implementation of the harvest 
interface was simply a matter of adding another "metadata" 
format for the Pathways Core surrogate to the existing OAI- 
PMH interfaces. It is a requirement of the OAI-PMH that 
all metadata formats be expressed in XML according to an 
XML Schema. Thus, we created surrogates using RDF/XML 
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Fig. 10. Screenshot of the arXiv wrapper page augmented with Live Clip- 
board links. 



according to the W3C RDF/XML Schema [9]. Additionally, 
we agreed that, for convenience, all repositories would use a 
common metadataPrefix=pwc.rdf for this metadata format. 
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Fig. 11. Screenshot of the graph visualization showing an article from arXiv 
that contains a PDF file and a dataset. 

7.4 Implementation of the put service 

While the put interface is agnostic to the particulars of un- 
derlying repository technology and models, any concrete im- 
plementation of a put service must be attentive to the specific 
capabilities and limitations of the underlying repository ar- 
chitecture. There are many questions that arise pertaining to 
how a put service interprets a surrogate and the assumptions 
the service makes in interacting with a particular underlying 
repository. 

To support the overlay journal experiment, a put service 
was developed to interact with a Fedora repository. As no ex- 
isting protocol already provided the required functionality, a 
new REST-based service for Fedora was created for our ex- 
periments. The flexibility of Fedora made it well suited for 
accepting and ingesting both simple surrogates (single en- 
tity) and complex surrogates (a graph of entities). However, 
this flexibility provoked the realization that there are a num- 
ber issues to be considered in implementing an effective put 
service, which we detail in the following sections. 

7.4.1 Identifiers and lineage 

The put interface, itself, imposes no requirement on a receiv- 
ing repository in terms of how it should deal with identifiers. 
However, whether the repository assigns new identifiers for 
surrogates ingested or not, it is important that there is a way 
to later determine that an entity is a new instance of an exist- 
ing entity. The means that the providerlnfo of the original sur- 
rogate should be retained in the hasLineage property of the 
new entity as described in section 4.1.4. Thus the provider- 
Info provides the basis for "a chain of lineage" across multi- 
ple distributed repositories where each repository represents 
the same entity or entities in different contexts. 

7.4.2 Ingesting hierarchies or networks of objects 

There are many cases when a put service will receive a surro- 
gate that models a hierarchy or graph of related entities. This 



presents a challenge in terms of determining an appropriate 
ingest policy for how surrogates will be processed, and what 
kinds of digital objects will ultimately be created in a receiv- 
ing repository. 

When surrogates contain a hierarchy of entities, some as- 
sumptions must be made as to the nature of the relationships 
of the entities in the hierarchy. Do parent-child relationships 
of a hierarchy imply a part-whole composition? Is the pres- 
ence or absence of each part essential to the integrity of the 
whole? Alternatively, is the hierarchy to be interpreted as a 
looser containment relationship, where the integrity of the 
whole is not compromised if its parts are disassociated? 

In our experiment, the put service assumed that any en- 
tity within a surrogate that contained providerlnfo should be 
managed as its own digital object within the target Fedora 
repository. Thus, in the case of the journal overlay example, 
all journal, issue, and article entities were to be represented as 
separate Fedora digital objects with appropriate relationships 
asserted among them. Furthermore, any sub-entities of article 
entities with providerlnfo were also represented as digital ob- 
jects in their own right. In the experiments, there were article 
entities that were comprised of both a document and datasets, 
and each of these was represented as a separate digital object. 

The end result of creating a journal issue via the put in- 
terface, was the creation of a graph of related digital objects 
in the target Fedora repository. From a management perspec- 
tive, this modular and atomic arrangement can enable flexible 
management of objects. For example, it is easier to discover 
and do something with all dataset objects than it would be to 
find all types of objects that may encapsulate datasets. From 
an access standpoint, each component is registered as a dig- 
ital object with its own public identifier and is available for 
re-use. This approach facilitated the ability to obtain the en- 
tire journal, or any sub-part down the hierarchy via the obtain 
service is a simple and generic manner. It was not necessary 
to create special services to discover and extract entities that 
were encapsulated within other objects. On the other hand, 
additional processes would be required to implement facil- 
ities that depend on handling all the constituent parts of a 
given digital object. 



8 Future work 

Much of the intellectual effort in this work has been to pare 
the Pathways Core data model down to its essential compo- 
nents. Having successfully carried out some initial experi- 
ments we intend to explore other scenarios, including those 
described in the introduction, to see where additional rich- 
ness is required. We envision extensions or refinements to re- 
lationships in the model. One example is the notion of entity 
containment to include more specific semantics: distinction 
of part/whole vs. alternative or auxiliary; indication of equiv- 
alence; and the notion of ordering or not. In the overlay jour- 
nal example, how and where should the order of articles be 
expressed in a surrogate for the journal issue? 
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The hooks from URI expression of semantic information 
and format identifiers open up tantalizing ties to current ser- 
vice matching work, the semantic web, and to ontology-based 
reasoning. It seems likely that PRONOM and GDFR format 
registries will coexist and be used by different segments of 
the scholarly community. How can we use ontologies in sys- 
tems that will "understand" the equivalences and differences 
between these specifications? How can notions of generaliza- 
tion be applied to semantic information created in different 
contexts, and how can we service match on compositional 
semantics? For example, if an object contains a set of JPEG- 
2000 entities, should it be treated as a scanned book or a 
photo album? 

If we imagine a landscape of widespread re-use of digi- 
tal objects, there will undoubtedly be many copies and ver- 
sions in different repositories and this provokes a number of 
questions. When designing a put interface, how can one un- 
derstand if a surrogate duplicates an existing digital object? 
When a duplicate is discovered as part of a put request, should 
the current object be replaced? Or should multiple versions of 
the digital object be managed? These are repository-specific 
decisions, but whatever is decided may have significant im- 
plications in collaborative scholarly workflows. How should 
surrogates be validated? 

The experiments described here have been performed over 
repositories that have primarily document content. The Path- 
ways framework was conceived with a rich environment of 
documents, data and other media-types in mind. Future work 
will involve collaboration with other repositories that include 
significant data-repositories and other repository architectures. 



9 Conclusions 

Our experiments successfully demonstrated the ability to move 
surrogates of digital objects among repositories, and to re-use 
them in new contexts. The proposed interoperability frame- 
work allowed us to show how the basic workflow necessary 
to create a new issue of an overlay journal could be supported 
across heterogeneous repositories. The simplicity and gen- 
erality of the Pathways Core data model allowed its use to 
create surrogates for digital objects held in Fedora, aDORe, 
DSpace and arXiv repositories, each with significantly differ- 
ent internal data models and architectures. Furthermore, by 
leveraging existing implementations of OpenURL resolvers 
and OAI-PMH interfaces for the repositories, it was remark- 
ably easy to provide the dissemination (obtain) and harvest 
interfaces necessary for each repository to participate. The in- 
gest (put) interface was implemented only for a Fedora repos- 
itory and involved considerably more design decisions, many 
of which require further investigation to determine best prac- 
tices. 

These results are serving as the basis for further experi- 
ments with even more heterogeneous repository architectures, 
to include data repositories in particular. These experiments 
will implement other important value chains that are neces- 



sary to move toward the goal of the creation of a global schol- 
arly communication system. 
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